AITopics | hard attention

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsMar-13-2026, 12:37:50 GMT

Latent Alignment and Variational Attention

Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, Alexander Rush

Arelated latent approach, hard attention, fixesthese issues, butisgenerally harder to train and less accurate.

artificial intelligence, machine learning, natural language, (17 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsNov-20-2025, 22:52:53 GMT

Latent Alignment and Variational Attention

Neural attention has become central to many state-of-the-art models in natural language processing and related domains. Attention networks are an easy-to-train and effective method for softly simulating alignment; however, the approach does not marginalize over latent alignments in a probabilistic sense. This property makes it difficult to compare attention to other alignment approaches, to compose it with probabilistic models, and to perform posterior inference conditioned on observed data. A related latent approach, hard attention, fixes these issues, but is generally harder to train and less accurate. This work considers variational attention networks, alternatives to soft and hard attention for learning latent variable alignment models, with tighter approximation bounds based on amortized variational inference. We further propose methods for reducing the variance of gradients to make these approaches computationally feasible. Experiments show that for machine translation and visual question answering, inefficient exact latent variable models outperform standard neural attention, but these gains go away when using hard attention based training. On the other hand, variational attention retains most of the performance gain but with training speed comparable to neural attention.

hard attention, latent alignment and variational attention, name change, (3 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsNov-20-2025, 19:32:56 GMT

Latent Alignment and Variational Attention

Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, Alexander Rush

Figure 1: Sketch of variational attention applied to machine translation.

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-2-2025

The Transformer Cookbook

Yang, Andy, Watson, Christopher, Xue, Anton, Bhattamishra, Satwik, Llarena, Jose, Merrill, William, Ferreira, Emile Dos Santos, Svete, Anej, Chiang, David

We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer's parameters. This work addresses the steep learning curve of such endeavors, a problem exacerbated by a fragmented literature where key results are scattered across numerous papers. In particular, we synthesize this disparate body of findings into a curated set of recipes that demonstrate how to implement everything from basic arithmetic in feed-forward layers to complex data routing via self-attention. Our mise en place of formulations is for both newcomers seeking an accessible entry point and experts in need of a systematic reference. This unified presentation of transformer constructions provides a foundation for future work spanning theoretical research in computational complexity to empirical investigations in architecture design and interpretability.

artificial intelligence, machine learning, natural language, (20 more...)

2510.00368

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Strobl, Lena, Angluin, Dana, Frank, Robert

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

arXiv.org Artificial IntelligenceMar-27-2025

While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers, focusing on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument. We prove that concise 1-layer transformers (i.e., with a polylog bound on the product of the number of heads, the embedding dimension, and precision) are capable of doing this task under some representations of the input, but not when the function's inputs and values are only encoded in different input positions. Concise 2-layer transformers can perform the task even with the more difficult input representation. Experimentally, we find a rough alignment between what we have proven can be computed by concise transformers and what can be practically learned.

artificial intelligence, machine learning, natural language, (19 more...)

2503.22076

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)

arXiv.org Artificial IntelligenceMar-18-2025

Unique Hard Attention: A Tale of Two Sides

Jerad, Selim, Svete, Anej, Li, Jiaoda, Cotterell, Ryan

Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

artificial intelligence, logic & formal reasoning, machine learning, (18 more...)

2503.14615

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > New York (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)

arXiv.org Artificial IntelligenceDec-13-2024

Simulating Hard Attention Using Soft Attention

Yang, Andy, Strobl, Lena, Chiang, David, Angluin, Dana

We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.

artificial intelligence, machine learning, natural language, (19 more...)

2412.09925

Country:

Asia > Middle East > Jordan (0.04)
Europe > Sweden > Västerbotten County > Umeå (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsOct-8-2024, 01:41:46 GMT

Reviews: Latent Alignment and Variational Attention

Update based on author rebuttal: I believe the authors have addressed the main criticisms of this paper (not clear how it's different from prior work) and have also provided additional experiments. I've raised my score accordingly. This paper focuses on using variational inference to train models with a "latent" (stochastic) attention mechanism. They consider attention as a categorical or dirichlet random variable, and explore using posterior inference on alignment decisions. They test various approaches on NMT and visual question answering datasets.

attention-based model, domain structure, latent alignment and variational attention, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.53)

Wu, Shijie, Shapiro, Pamela, Cotterell, Ryan

Hard Non-Monotonic Attention for Character-Level Transduction

arXiv.org Artificial IntelligenceFeb-20-2024

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks such as image captioning (Xu et al., 2015), and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention. Code is available at https://github. com/shijie-wu/neural-transducer.

alignment, computational linguistic, proceedings, (13 more...)